Pesquisa | Portal Regional da BVS

1.

Is it quality, is it redundancy, or is model inadequacy? Some strategies for judging the ap-propriateness of high-discrimination items / ¿Es calidad, redundancia o inadecuación del modelo? Algunas estrategias para determinar la idoneidad de los ítems altamente discriminativos

Pere, J. Ferrando; Morales-Vives, Fabia.

An. psicol ; 39(3): 517-527, Oct-Dic, 2023. ilus, tab, graf

Artigo em Inglês | IBECS | ID: ibc-224953

RESUMO

Cuando se desarrollan nuevos cuestionarios, tradicionalmente se asume que los ítems deben ser lo más discriminativos posible, como si esto fuera siempre indicativo de su calidad. Pero en algunos casos estas discriminaciones elevadas pueden estar ocultando algunos problemas como redundancias, residuales compartidos, distribuciones sesgadas o limitaciones del modelo que pueden contribuir a inflar las estimaciones de la discriminación. Por lo tanto, la inspección de estos índices puede llevar a decisiones erróneas sobre qué ítems mantener o eliminar. Para ilustrar este problema, se describen dos escenarios diferentes con datos reales. El primero se centra en un cuestionario que contiene un ítem aparentemente muy discriminante, pero redundante. El segundo se centra en un cuestionario clínico administrado a una muestra comunitaria, lo que da lugar a distribuciones de respuesta de los ítems muy sesgadas y a índices de discriminación inflados, a pesar de que los ítems no discriminan bien entre la mayoría de los sujetos. Proponemos algunas estrategias y comprobaciones para identificar estas situaciones, para facilitar la identificación y eliminación de los ítems inapropiados. Por lo tanto, este artículo pretende promover una actitud crítica, que puede implicar ir en contra de los principios rutinarios establecidos cuando no son apropiados.(AU)

When developing new questionnaires, it is traditionally assumed that the items should be as discriminative as possible, as if this was always indicative of their quality. However, in some cases these high discrimina-tions may be masking some problems such as redundancies, shared residu-als, biased distributions, or model limitations which may contribute to in-flate the discrimination estimates. Therefore, the inspection of these indi-ces may lead to erroneous decisions about which items to keep or elimi-nate. Toillustrate this problem, two different scenarios with real data are described. The first focuses on a questionnaire that contains an item ap-parently highly discriminant, but redundant. The second focuses on a clini-cal questionnaire administered to a community sample, which gives place to highly right-skewed item response distributions and inflated discrimi-nant indices, despite the items do not discriminate well among the majority of participants. We propose some strategies and checks to identify these situations, so that the items that are inappropriate may be identified and removed. Therefore, this article seeks to promote a critical attitude, which may involve going against routine stablished principles when they are not appropriate.(AU)

Assuntos

Humanos , Inquéritos e Questionários/classificação , Inquéritos e Questionários/estatística & dados numéricos , Análise Fatorial , Psicometria/métodos , Psicometria/estatística & dados numéricos

2.

Response Category Functioning on the Health Care Engagement Measure Using the Nominal Response Model.

Reise, Steven P; Hubbard, Anne S; Wong, Emily F; Schalet, Benjamin D; Haviland, Mark G; Kimerling, Rachel.

Assessment ; 30(2): 375-389, 2023 03.

Artigo em Inglês | MEDLINE | ID: mdl-34706571

RESUMO

As part of a scale development project, we fit a nominal response item response theory model to responses to the Health Care Engagement Measure (HEM). When using the original 5-point response format, categories were not ordered as intended for six of the 23 items. For the remaining, the category boundary discrimination between Categories 0 (not at all true) and 1 (a little bit true) was only weakly discriminating, suggesting uninformative categories. When the lowest two categories were collapsed, psychometric properties improved greatly. Category boundary discriminations within items, however, varied significantly. Specifically, higher response category distinctions, such as responding 3 (very true) versus 2 (mostly true) were considerably more discriminating than lower response category distinctions. Implications for HEM scoring and for improving measurement precision at lower levels of the construct are presented as is the unique role of the nominal response model in category analysis.

Assuntos

Psicometria , Humanos , Inquéritos e Questionários

3.

Developing valid test bank of surveillance case study scenarios for inter-rater reliability.

Holmes, Kelly; Moinuddin, Mishga; Steinfeld, Sandi.

Am J Infect Control ; 50(8): 960-962, 2022 08.

Artigo em Inglês | MEDLINE | ID: mdl-35158010

RESUMO

Case studies are utilized for training on National Healthcare Safety Network (NHSN) healthcare associated infection surveillance definitions. Item discrimination and item analysis were applied to case studies to identify questions that most accurately assess infection preventionists (IPs) application of surveillance definitions.

Assuntos

Infecção Hospitalar , Infecção Hospitalar/epidemiologia , Infecção Hospitalar/prevenção & controle , Confiabilidade dos Dados , Instalações de Saúde , Humanos , Reprodutibilidade dos Testes

4.

A Signal Detection Model for Multiple-Choice Exams.

DeCarlo, Lawrence T.

Appl Psychol Meas ; 45(6): 423-440, 2021 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-34565945

RESUMO

A model for multiple-choice exams is developed from a signal-detection perspective. A correct alternative in a multiple-choice exam can be viewed as being a signal embedded in noise (incorrect alternatives). Examinees are assumed to have perceptions of the plausibility of each alternative, and the decision process is to choose the most plausible alternative. It is also assumed that each examinee either knows or does not know each item. These assumptions together lead to a signal detection choice model for multiple-choice exams. The model can be viewed, statistically, as a mixture extension, with random mixing, of the traditional choice model, or similarly, as a grade-of-membership extension. A version of the model with extreme value distributions is developed, in which case the model simplifies to a mixture multinomial logit model with random mixing. The approach is shown to offer measures of item discrimination and difficulty, along with information about the relative plausibility of each of the alternatives. The model, parameters, and measures derived from the parameters are compared to those obtained with several commonly used item response theory models. An application of the model to an educational data set is presented.

5.

A Multidimensional Item Response Theory Model for Continuous and Graded Responses With Error in Persons and Items.

Ferrando, Pere J; Navarro-González, David.

Educ Psychol Meas ; 81(6): 1029-1053, 2021 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-34552274

RESUMO

Item response theory "dual" models (DMs) in which both items and individuals are viewed as sources of differential measurement error so far have been proposed only for unidimensional measures. This article proposes two multidimensional extensions of existing DMs: the M-DTCRM (dual Thurstonian continuous response model), intended for (approximately) continuous responses, and the M-DTGRM (dual Thurstonian graded response model), intended for ordered-categorical responses (including binary). A rationale for the extension to the multiple-content-dimensions case, which is based on the concept of the multidimensional location index, is first proposed and discussed. Then, the models are described using both the factor-analytic and the item response theory parameterizations. Procedures for (a) calibrating the items, (b) scoring individuals, (c) assessing model appropriateness, and (d) assessing measurement precision are finally discussed. The simulation results suggest that the proposal is quite feasible, and an illustrative example based on personality data is also provided. The proposals are submitted to be of particular interest for the case of multidimensional questionnaires in which the number of items per scale would not be enough for arriving at stable estimates if the existing unidimensional DMs were fitted on a separate-scale basis.

6.

Reliability and External Validity of Personality Test Scores: The Role of Person and Item Error / Fiabilidad y Validez Externa de las Puntuaciones en los Tests de Personalidad: El Papel del Error en las Personas y en los Ítems

Ferrando, Pere J; Navarro-González, David.

Psicothema (Oviedo) ; 33(2): 259-267, 2021. graf

Artigo em Inglês | IBECS | ID: ibc-225503

RESUMO

Background: This article explores the suitability of a proposed Dual model, in which both people and items are sources of measurement error, by assessing how the test scores are expected to behave in terms of marginal reliability and external validity when the model holds. Method: Analytical derivations are produced for predicting: (a) the impact of person and item errors in the amount of marginal reliability and external validity, as well as the occurrence of ceiling effects; (b) the changes in test reliability across groups with different average amounts of person error, and (c) the phenomenon of differential predictability. Two empirical studies are also used both as an illustration and as a check of the predicted results. Results: Results show that the model-based predictions agree with existing evidence as well as with basic principles in classical test theory. However, the additional inclusion of individuals as a source of error leads to new explanations and predictions. Conclusions: The proposal and results provide new sources of information in personality assessment as well as of evidence of model suitability. They also help to explain some disappointing recurrent results. (AU)

Antecedentes: se explora la adecuación de un modelo Dual en el que ítems y personas son fuente de error de medida, evaluando como se espera que se comporten las puntuaciones en un test en términos de fiabilidad y validez cuando el modelo se cumple. Método: se derivan analíticamente predicciones respecto a: (a) el impacto del error en personas y en ítems en las estimaciones de fiabilidad y validez externa, así como en efectos techo esperados, (b) cambios en la fiabilidad marginal en grupos con diferente magnitud media de error individual, y (c) el fenómeno de la predictibilidad diferencial. Se incluyen dos estudios empíricos a efectos de ilustración y verificación empírica. Resultados: las predicciones concuerdan con la evidencia acumulada y con los principios de la teoría clásica del test. Sin embargo, la inclusión del parámetro de error individual permite llegar a nuevas explicaciones y predicciones. Conclusiones: la propuesta y resultados proporcionan nuevas fuentes de información en la medida de la personalidad, así como evidencia de la adecuación del modelo. También explican algunos resultados decepcionantes y recurrentes. (AU)

Assuntos

Humanos , Testes de Personalidade/estatística & dados numéricos , Reprodutibilidade dos Testes , Valor Preditivo dos Testes

7.

Rethinking the Interpretation of Item Discrimination and Factor Loadings.

Jordan, Pascal; Spiess, Martin.

Educ Psychol Meas ; 79(6): 1103-1132, 2019 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-31619841

RESUMO

Factor loadings and item discrimination parameters play a key role in scale construction. A multitude of heuristics regarding their interpretation are hardwired into practice-for example, neglecting low loadings and assigning items to exactly one scale. We challenge the common sense interpretation of these parameters by providing counterexamples and general results which altogether cast doubt on our understanding of these parameters. In particular, we highlight the counterintuitive way in which the best prediction of a test taker's latent ability depends on the factor loadings. As a consequence, we emphasize that practitioners need to shift their focus from interpreting item discrimination parameters by their relative loading to an interpretation which incorporates the structure of the model-based latent ability estimate.

8.

A Comprehensive IRT Approach for Modeling Binary, Graded, and Continuous Responses With Error in Persons and Items.

Ferrando, Pere J.

Appl Psychol Meas ; 43(5): 339-359, 2019 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-31235981

RESUMO

Dual item response theory (IRT) models in which items and individuals have different amounts of measurement error have been proposed in the literature. Any developments in these models, however, are feasible only for continuous responses. This article discusses a comprehensive dual modeling approach, based on underlying latent response variables, from which specific models for continuous, graded, and binary responses are obtained. Procedures for (a) calibrating the items, (b) scoring individuals, (c) assessing model appropriateness, and (d) assessing measurement precision are discussed for all the resulting models. Simulation results suggest that the proposal is quite feasible. A practical illustration is given with an empirical example in the personality domain.

9.

Effectiveness of e-Learning in a Medical School 2.0 Model: Comparison of Item Analysis for Student-Generated vs. Faculty-Generated Multiple-Choice Questions.

Janzen, Bryan W; Sommerfeld, Connor; Gooi, Adrian C C.

Stud Health Technol Inform ; 257: 184-188, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-30741193

RESUMO

BACKGROUND: Early reports in the literature describe using student-generated questions as a method of student learning as well as augmenting question exam banks. Reports on the performance of student-generated questions versus faculty-generated questions, however, remain limited. This study aims to compare the question performance of student-generated versus faculty-generated multiple-choice questions (MCQ). OBJECTIVES: To determine if student-generated questions using mobile audience response systems and online discussion boards have similar item discrimination scores as faculty-generated questions. METHODS: A team-based learning session was used to create 113 student-generated multiple-choice questions (SGQs). A 20 question MCQ quiz was presented to a second year medical school class made of 10 randomly selected SGQs and 10 randomly selected faculty-generated multiple-choice questions (FGQs). Item analysis was performed on the test results. RESULTS: The data showed no statistical difference in the point-biserial scores between the two groups (average point-biserial 0.31 students vs 0.36 faculty, p=0.14), with 90% of student-generated and 100% of faculty-generated questions meeting a cut-off of point-biserial score >0.2. Interestingly, student-generated questions were statistically more difficult than the faculty-generated questions (Item Difficulty score 0.46 students vs 0.69 faculty, p=0.003). CONCLUSIONS: This study suggests that student-generated compared to faculty-generated MCQs have similar item discrimination scores, but are perhaps more difficult questions.

Assuntos

Instrução por Computador , Faculdades de Medicina , Estudantes de Medicina , Instrução por Computador/normas , Avaliação Educacional , Docentes , Humanos , Aprendizagem

10.

Item-Score Reliability in Empirical-Data Sets and Its Relationship With Other Item Indices.

Zijlmans, Eva A O; Tijmstra, Jesper; van der Ark, L Andries; Sijtsma, Klaas.

Educ Psychol Meas ; 78(6): 998-1020, 2018 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-30542214

RESUMO

Reliability is usually estimated for a total score, but it can also be estimated for item scores. Item-score reliability can be useful to assess the repeatability of an individual item score in a group. Three methods to estimate item-score reliability are discussed, known as method MS, method λ 6 , and method CA. The item-score reliability methods are compared with four well-known and widely accepted item indices, which are the item-rest correlation, the item-factor loading, the item scalability, and the item discrimination. Realistic values for item-score reliability in empirical-data sets are monitored to obtain an impression of the values to be expected in other empirical-data sets. The relation between the three item-score reliability methods and the four well-known item indices are investigated. Tentatively, a minimum value for the item-score reliability methods to be used in item analysis is recommended.

11.

Examining the impact of specific types of item-writing flaws on student performance and psychometric properties of the multiple choice question.

Pham, Hannah; Besanko, James; Devitt, Peter.

MedEdPublish (2016) ; 7: 225, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-38089249

RESUMO

This article was migrated. The article was marked as recommended. Background: Item-writing flaws (IWFs) are common in multiple choice questions (MCQs) despite item-writing guidelines. Previous studies have shown that IWFs impact validity as observed through student performance, item difficulty, and discrimination. Most previous studies have examined IWFs collectively and have shown that they have a diverse impact. The aim of the study was to determine if the effects of individual types of IWFs are systematic and predictable. Method: A cross-over study design was used. 100 pairs of MCQ items (with and without an IWF) were constructed to test 10 types of IWFs. Medical students were invited to participate in a mock examination. Paper A consisted of 50 flawed followed by 50 unflawed items. Paper B consisted of 50 unflawed followed by 50 flawed items. The effect of each of the IWFs on mean item scores, item difficulty and discrimination were examined. Results: The hypothesised effect of IWFs on mean item scores was confirmed in only 4 out of 10 cases. ' Longest choice is correct', 'Clues to the right answer (Eponymous terms)' and ' Implausible distractors' positively impacted, while 'Central idea in choices rather than stem' negatively impacted mean item scores. Other flaws had either the opposite or no statistically significant effect. IWFs did not impact item difficulty or discrimination. Conclusion: The effect of IWFs is neither systematic nor predictable. Unpredictability in assessment produces error and thus loss of validity. Therefore, IWFs should be avoided. Faculties should be encouraged to invest in item-writing workshops in order to improve MCQs. However, the cost of doing so should be carefully weighed against the benefits of developing programmes of assessment.

12.

Unidimensional IRT Item Parameter Estimates Across Equivalent Test Forms With Confounding Specifications Within Dimensions.

Matlock, Ki Lynn; Turner, Ronna.

Educ Psychol Meas ; 76(2): 258-279, 2016 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-29795865

RESUMO

When constructing multiple test forms, the number of items and the total test difficulty are often equivalent. Not all test developers match the number of items and/or average item difficulty within subcontent areas. In this simulation study, six test forms were constructed having an equal number of items and average item difficulty overall. Manipulated variables were the number of items and average item difficulty within subsets of items primarily measuring one of two dimensions. Data sets were simulated at four levels of correlation (0, .3, .6, and .9). Item parameters were estimated using the Rasch and two-parameter logistic unidimensional item response theory models. Estimated discrimination and difficulty were compared across forms and to the true item parameters. The average unidimensional estimated discrimination was consistent across forms having the same correlation. Forms having a larger set of easy items measuring one dimension were estimated as being more difficult than forms having a larger set of hard items. Estimates were also investigated within subsets of items, and measures of bias were reported. This study encourages test developers to not only maintain consistent test specifications across forms as a whole but also within subcontent areas.

13.

O impacto da mudança no critério de acerto na distribuição dos escores do subteste de leitura do teste de desempenho escolar / El influjo del cambio de criterio de acierto en la distribución de los puntajes de la subprueba de lectura en la prueba de rendimiento escolar / The impact of the change in the criterion of scoring correct responses in the distribution of the scores of the reading subtest of performance school test

Lúcio, Patrícia Silva; Pinheiro, Ângela M. V; Nascimento, Elizabeth do.

Psicol. estud ; 14(3): 593-601, jul.-set. 2009. tab, ilus

Artigo em Português | Index Psicologia - Periódicos | ID: psi-51831

RESUMO

O estudo investiga o impacto da introdução de uma nova classe de erros no subteste de leitura do Teste de Desempenho Escolar na distribuição dos escores. Participaram do estudo 306 crianças de 1ª a 4ª séries da rede de ensino de Belo Horizonte. Para cada participante foram conferidos dois escores: um estabelecido a partir de critério do manual (EB1) e o outro a partir do critério adotado no presente estudo (EB2) - que considerou incorretas as respostas de silabação e de correção espontânea. Os resultados mostraram que a adoção do critério EB2 tornou o teste mais discriminativo, mas não impediu o aparecimento de um efeito de teto, nem proporcionou uma distribuição normal dos escores. As análises mostraram que, além do critério permissivo, o excesso de palavras fáceis e a escassez de palavras difíceis prejudicaram a variabilidade dos escores. Hipóteses sobre a dificuldade das palavras são discutidas à luz da psicolinguística.(AU)

The study investigates the impact of the introduction of a new class of errors within the original error set of the reading subtest of the Performance School Test on the distribution of the scores. The sample consisted of 306 1st - 4th grades of the school system of Belo Horizonte who participated in the study. For each child two scores were given: one based on the manual criteria (EB1) and the other on the criteria introduced in the present study (EB2) - which considered explicit sounding out of syllables or of part of the stimuli and self-corrections as incorrect responses. The results showed that the choice of the EB2 made the test more discriminative, but such a modification did not prevent the emergence of a ceiling effect, nor did it enable a normal distribution of the scores. The analysis showed that apart from the permissive criteria the excess of easy words and the shortage of difficult words affected the variability of the scores. Hypotheses about the difficulty of words were discussed in light of the psycholinguistics.(AU)

El estudio investiga el influjo que la introducción de una nueva clase de errores en la subprueba de lectura de la Prueba de Rendimiento Escolar tuvo en la distribución de los puntajes de la referida subprueba. Participaron de la investigación 306 niños de 1ª a 4ª clase de la Red de Enseñanza de Belo Horizonte. Para cada participante han sido conferidos dos puntajes: uno establecido a partir de criterio del manual (EB1) y otro a partir del criterio adoptado en la presente investigación (EB2) - que consideró incorrectas las respuestas de silabación y de corrección espontánea. Los resultados han mostrado que la adopción del criterio EB2 tornó la prueba más discriminadora, pero esa modificación no impidió el asomo de un efecto de techo, ni proporcionó una distribución normal de los puntajes. Los análisis han mostrado que, además del criterio permisivo, el exceso de palabras fáciles y la escasez de palabras difíciles perjudicaron la variabilidad de los puntajes. Hipótesis sobre la dificultad de las palabras son discutidas a la luz de la psicolingüística.(AU)

Assuntos

Humanos , Masculino , Feminino , Criança , Leitura , Psicolinguística

14.

O impacto da mudança no critério de acerto na distribuição dos escores do subteste de leitura do teste de desempenho escolar / El influjo del cambio de criterio de acierto en la distribución de los puntajes de la subprueba de lectura en la prueba de rendimiento escolar / The impact of the change in the criterion of scoring correct responses in the distribution of the scores of the reading subtest of performance school test

Lúcio, Patrícia Silva; Pinheiro, Ângela M. V; Nascimento, Elizabeth do.

Psicol. estud ; 14(3): 593-601, jul.-set. 2009. tab, ilus

Artigo em Português | LILACS | ID: lil-537000

RESUMO

O estudo investiga o impacto da introdução de uma nova classe de erros no subteste de leitura do Teste de Desempenho Escolar na distribuição dos escores. Participaram do estudo 306 crianças de 1ª a 4ª séries da rede de ensino de Belo Horizonte. Para cada participante foram conferidos dois escores: um estabelecido a partir de critério do manual (EB1) e o outro a partir do critério adotado no presente estudo (EB2) - que considerou incorretas as respostas de silabação e de correção espontânea. Os resultados mostraram que a adoção do critério EB2 tornou o teste mais discriminativo, mas não impediu o aparecimento de um efeito de teto, nem proporcionou uma distribuição normal dos escores. As análises mostraram que, além do critério permissivo, o excesso de palavras fáceis e a escassez de palavras difíceis prejudicaram a variabilidade dos escores. Hipóteses sobre a dificuldade das palavras são discutidas à luz da psicolinguística.

The study investigates the impact of the introduction of a new class of errors within the original error set of the reading subtest of the Performance School Test on the distribution of the scores. The sample consisted of 306 1st - 4th grades of the school system of Belo Horizonte who participated in the study. For each child two scores were given: one based on the manual criteria (EB1) and the other on the criteria introduced in the present study (EB2) - which considered explicit sounding out of syllables or of part of the stimuli and self-corrections as incorrect responses. The results showed that the choice of the EB2 made the test more discriminative, but such a modification did not prevent the emergence of a ceiling effect, nor did it enable a normal distribution of the scores. The analysis showed that apart from the permissive criteria the excess of easy words and the shortage of difficult words affected the variability of the scores. Hypotheses about the difficulty of words were discussed in light of the psycholinguistics.

El estudio investiga el influjo que la introducción de una nueva clase de errores en la subprueba de lectura de la Prueba de Rendimiento Escolar tuvo en la distribución de los puntajes de la referida subprueba. Participaron de la investigación 306 niños de 1ª a 4ª clase de la Red de Enseñanza de Belo Horizonte. Para cada participante han sido conferidos dos puntajes: uno establecido a partir de criterio del manual (EB1) y otro a partir del criterio adoptado en la presente investigación (EB2) - que consideró incorrectas las respuestas de silabación y de corrección espontánea. Los resultados han mostrado que la adopción del criterio EB2 tornó la prueba más discriminadora, pero esa modificación no impidió el asomo de un efecto de techo, ni proporcionó una distribución normal de los puntajes. Los análisis han mostrado que, además del criterio permisivo, el exceso de palabras fáciles y la escasez de palabras difíciles perjudicaron la variabilidad de los puntajes. Hipótesis sobre la dificultad de las palabras son discutidas a la luz de la psicolingüística.

Assuntos

Humanos , Masculino , Feminino , Criança , Psicolinguística , Leitura

15.

Several Parameters and Their Usages in Software System for Paper Quality Analysis / 中华医学教育探索杂志

Kaocong TIAN; Bin PENG.

Chinese Journal of Medical Education Research ; (12)2003.

Artigo em Chinês | WPRIM (Pacífico Ocidental) | ID: wpr-622554

RESUMO

Degree of Confidence(?),Validity(E),Item Difficulty(P) and Item Discrimination(D) used in Software System for Paper Quality Analysis play an important role in evaluating the qualities of items and papers.Their usages and meanings are discussed in detail here,and an example is given to illustrate their significance and important roles in items analysis and papers analysis.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA